Corpus Statistics Meet the Noun Compound: 
Some Empirical Results 

Mark Lauer 

Microsoft Institute 
65 Epping Road, 
North Ryde NSW 2113 

Australia 
t-markl@microsof t . com 



Abstract 

A variety of statistical methods for noun 
compound analysis are implemented and 
compared. The results support two main 
conclusions. First, the use of conceptual 
association not only enables a broad cover- 
age, but also improves the accuracy. Sec- 
ond, an analysis model based on depen- 
dency grammar is substantially more ac- 
curate than one based on deepest con- 
stituents, even though the latter is more 
prevalent in the literature. 

1 Background 

1.1 Compound Nouns 

If parsing is taken to be the first step in taming 
the natural language understanding task, then broad 
coverage NLP remains a jungle inhabited by wild 
beasts. For instance, parsing noun compounds ap- 
pears to require detailed world knowledge that is 
unavailable outside a limited domain (Sparck Jones, 
1983). Yet, far from being an obscure, endan- 
gered species, the noun compound is flourishing in 
modern language. It has already made five ap- 
pearances in this paragraph and at least one di- 
achronic study shows a veritable population explo- 
sion (Leonard, 1984). While substantial work on 
noun compounds exists in both linguistics (e.g. Levi, 
1978; Ryder, 1994) and computational linguistics 
(Finin, 1980; McDonald, 1982; Isabelle, 1984), tech- 
niques suitable for broad coverage parsing remain 
unavailable. This paper explores the application 
of corpus statistics (Charniak, 1993) to noun com- 
pound parsing (other computational problems are 
addressed in Arens et al, 1987; Vanderwende, 1993 
and Sproat, 1994). 

The task is illustrated in example 0: 

Example 1 



(a) [womauN [aidN workerN]] 

(b) [[hydrogeuN Ioun] exchangCN] 

The parses assigned to these two compounds dif- 
fer, even though the sequence of parts of speech are 
identical. The problem is analogous to the prepo- 
sitional phrase attachment task explored in Hindle 
and Rooth (1993). The approach they propose in- 
volves computing lexical associations from a corpus 
and using these to select the correct parse. A similar 
architecture may be applied to noun compounds. 

In the experiments below the accuracy of such a 
system is measured. Comparisons are made across 
five dimensions: 

• Each of two analysis models are applied: adja- 
cency and dependency. 

• Each of a range of training schemes are em- 
ployed. 

• Results are computed with and without tuning 
factors suggested in the literature. 

• Each of two parameterisations are used: associ- 
ations between words and associations between 
concepts. 

• Results are collected with and without machine 
tagging of the corpus. 

1.2 Training Schemes 

While Hindle and Rooth (1993) use a partial parser 
to acquire training data, such machinery appears un- 
necessary for noun compounds. Brent (1993) has 
proposed the use of simple word patterns for the ac- 
quisition of verb subcategorisation information. An 
analogous approach to compounds is used in Lauer 
(1994) and constitutes one scheme evaluated below. 
While such patterns produce false training exam- 
ples, the resulting noise often only introduces minor 
distortions. 



A more liberal alternative is the use of a co- 
occurrence window. Yarowsky (1992) uses a fixed 
100 word window to collect information used for 
sense disambiguation. Similarly, Smadja (1993) uses 
a six content word window to extract significant col- 
locations. A range of windowed training schemes are 
employed below. Importantly, the use of a window 
provides a natural means of trading off the amount 
of data against its quality. When data sparsencss un- 
dermines the system accuracy, a wider window may 
admit a sufficient volume of extra accurate data to 
outweigh the additional noise. 

1.3 Noun Compound Analysis 

There are at least four existing corpus-based al- 
gorithms proposed for syntactically analysing noun 
compounds. Only two of these have been subjected 
to evaluation, and in each case, no comparison to 
any of the other three was performed. In fact all au- 
thors appear unaware of the other three proposals. 
I will therefore briefly describe these algorithms. 

Three of the algorithms use what I will call the 
ADJACENCY MODEL, an analysis procedure that goes 
back to Marcus (1980, p253). Therein, the proce- 
dure is stated in terms of calls to an oracle which 
can determine if a noun compound is acceptable. It 
is reproduced here for reference: 

Given three nouns tii, n2 and n^: 

• If cither [ni or [712 n^] is not semantically 
acceptable then build the alternative structure; 

• otherwise, if [n2 n^] is semantically preferable 
to [ni 712] then build [n2 n^]', 

• otherwise, build [ni 712]. 

Only more recently has it been suggested that cor- 
pus statistics might provide the oracle, and this idea 
is the basis of the three algorithms which use the 
adjacency model. The simplest of these is reported 
in Pustejovsky et al (1993). Given a three word 
compound, a search is conducted elsewhere in the 
corpus for each of the two possible subcomponents. 
Whichever is found is then chosen as the more closely 
bracketed pair. For example, when backup compiler 
disk is encountered, the analysis will be: 

Example 2 

(a) [backupN [compilerN diskN]] 

when compiler disk appears elsewhere 

(b) [[backupN compilerN] diskN] 

when backup compiler appears elsewhere 

Since this is proposed merely as a rough heuristic, 
it is not stated what the outcome is to be if neither 



or both subcomponents appear. Nor is there any 
evaluation of the algorithm. 

The proposal of Liberman and Sproat (1992) is 
more sophisticated and allows for the frequency of 
the words in the compound. Their proposal in- 
volves comparing the mutual information between 
the two pairs of adjacent words and bracketing to- 
gether whichever pair exhibits the highest. Again, 
there is no evaluation of the method other than a 
demonstration that four examples work correctly. 

The third proposal based on the adjacency model 
appears in Rcsnik (1993) and is rather more complex 
again. The SELECTIONAL ASSOCIATION between a 
predicate and a word is defined based on the con- 
tribution of the word to the conditional entropy of 
the predicate. The association between each pair of 
words in the compound is then computed by taking 
the maximum selectional association from all pos- 
sible ways of regarding the pair as predicate and 
argument. Whilst this association metric is com- 
plicated, the decision procedure still follows the out- 
hne devised by Marcus (1980) above. Resnik (1993) 
used unambiguous noun compounds from the parsed 
Wall Street Journal (WSJ) corpus to estimate the 
association values and analysed a test set of around 
160 compounds. After some tuning, the accuracy 
was about 73%, as compared with a baseline of 64% 
achieved by always bracketing the first two nouns 
together. 

The fourth algorithm, first described in Lauer 
(1994), differs in one striking manner from the other 
three. It uses what I will call the dependency 
MODEL. This model utilises the following procedure 
when given three nouns ni, ?i2 and n^: 

• Determine how acceptable the structures [ni 112] 
and [ni 77.3] are; 

• if the latter is more acceptable, build [ri2 713] 
first; 

• otherwise, build [ni 712] first. 

Figure ^ shows a graphical comparison of the two 
analysis models. 

In Lauer (1994), the degree of acceptability is 
again provided by statistical measures over a cor- 
pus. The metric used is a mutual information- like 
measure based on probabilities of modification rela- 
tionships. This is derived from the idea that parse 
trees capture the structure of semantic relationships 
within a noun compound.^ 

^Lauer and Dras (1994) give a formal construction 
motivating the algorithm given in Lauer (1994). 
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Figure 1: Two analysis models and the associations they compare 



The dependency model attempts to choose a parse 
which makes the resulting relationships as accept- 
able as possible. For example, when backup compiler 
disk is encountered, the analysis will be: 

Example 3 

(a) [backupN [compilerN diskw]] 

when backup disk is more acceptable 

(b) [[backupN compiler^] diskN] 

when backup compiler is more acceptable 

I claim that the dependency model makes more in- 
tuitive sense for the following reason. Consider the 
compound calcium ion exchange, which is typically 
left-branching (that is, the first two words are brack- 
eted together). There does not seem to be any rea- 
son why calcium ion should be any more frequent 
than ion exchange. Both are plausible compounds 
and regardless of the bracketing, ions are the object 
of an exchange. Instead, the correct parse depends 
on whether calcium characterises the ions or medi- 
ates the exchange. 

Another significant difference between the mod- 
els is the predictions they make about the propor- 
tion of left and right-branching compounds. Lauer 
and Dras (1994) show that under a dependency 
model, left-branching compounds should occur twice 
as often as right-branching compounds (that is two- 
thirds of the time). In the test set used here and 
in that of Resnik (1993), the proportion of left- 
branching compounds is 67% and 64% respectively. 
In contrast, the adjacency model appears to predict 
a proportion of 50%. 

The dependency model has also been proposed by 
Kobayasi et al (1994) for analysing Japanese noun 
compounds, apparently independently. Using a cor- 
pus to acquire associations, they bracket sequences 
of Kanji with lengths four to six (roughly equiva- 
lent to two or three words). A simple calculation 
shows that using their own preprocessing hueristics 
to guess a bracketing provides a higher accuracy on 
their test set than their statistical model does. This 
renders their experiment inconclusive. 



2 Method 

2.1 Extracting a Test Set 

A test set of syntactically ambiguous noun com- 
pounds was extracted from our 8 million word 
Grolier's encyclopedia corpus in the following way.^ 
Because the corpus is not tagged or parsed, a some- 
what conservative strategy of looking for unambigu- 
ous sequences of nouns was used. To distinguish 
nouns from other words, the University of Penn- 
sylvania morphological analyser (described in Karp 
et al, 1992) was used to generate the set of words 
that can only be used as nouns (I shall henceforth 
call this set M). All consecutive sequences of these 
words were extracted, and the three word sequences 
used to form the test set. For reasons made clear 
below, only sequences consisting entirely of words 
from Roget's thesaurus were retained, giving a total 
of 308 test triples.0 

These triples were manually analysed using as 
context the entire article in which they appeared. In 
some cases, the sequence was not a noun compound 
(nouns can appear adjacent to one another across 
various constituent boundaries) and was marked as 
an error. Other compounds exhibited what Kin- 
dle and Rooth (1993) have termed semantic in- 
determinacy where the two possible bracketings 
cannot be distinguished in the context. The remain- 
ing compounds were assigned either a left-branching 
or right-branching analysis. Table |l| shows the num- 
ber of each kind and an example of each. 

Accuracy figures in all the results reported be- 
low were computed using only those 244 compounds 
which received a parse. 

2.2 Conceptual Association 

One problem with applying lexical association to 
noun compounds is the enormous number of param- 

^We would like to thank Grolier's for permission to 
use this material for research purposes. 

■'The 1911 version of Roget's used is available on-line 
and is in the public domain. 



Type 


Number 


Proportion 


Example 


Error 


29 


9% 


In monsoon regions rainfall does not . . . 


Indeterminate 


35 


11% 


Most advanced aircraft have precision navigation systems. 


Left-branching 


163 


53% 


. . . escaped punishment by the AUied war crimes tribunals. 


Right-branching 


81 


26% 


Ronald Reagan, who won two landslide election victories^ . . . 



Table 1: Test set distribution 



eters required, one for every possible pair of nouns. 
Not only does this require a vast amount of memory 
space, it creates a severe data sparseness problem 
since we require at least some data about each pa- 
rameter. Resnik and Hearst (1993) coined the term 
CONCEPTUAL ASSOCIATION to refer to association 
values computed between groups of words. By as- 
suming that all words within a group behave simi- 
larly, the parameter space can be built in terms of 
the groups rather than in terms of the words. 

In this study, conceptual association is used with 
groups consisting of all categories from the 1911 ver- 
sion of Roget's thesaurus.^ Given two thesaurus cat- 
egories ti and t2, there is a parameter which rep- 
resents the degree of acceptability of the structure 
[71,1712] where rii is a noun appearing in ti and n2 
appears in t2 . By the assumption that words within 
a group behave similarly, this is constant given the 
two categories. Following Lauer and Dras (1994) we 
can formally write this parameter as Pr{ti — * ^2) 
where the event ti — > ^2 denotes the modification of 
a noun in t2 by a noun in ii. 

2.3 Training 

To ensure that the test set is disjoint from the train- 
ing data, all occurrences of the test noun compounds 
have been removed from the training corpus. Two 
types of training scheme are explored in this study, 
both unsupervised. The first employs a pattern that 
follows Pustejovsky (1993) in counting the occur- 
rences of subcomponents. A training instance is any 
sequence of four words wi W2W3W4 where wi , W4 ^ N 
and W2, W3 G A/". Let countp(ni, 7^2) be the number 
of times a sequence winin2Wi occurs in the training 
corpus with wi,W4 ^ Af. 

The second type uses a window to collect train- 
ing instances by observing how often a pair of nouns 
co-occur within some fixed number of words. In this 
study, a variety of window sizes are used. For n > 2, 
let count„(ni, 712) be the number of times a sequence 
7iiw;i . . . Win2 occurs in the training corpus where 
i < 71 — 2. Note that windowed counts are asym- 
metric. In the case of a window two words wide, 
this yields the mutual information metric proposed 



by Liberman and Sproat (1992). 

Using each of these different training schemes to 
arrive at appropriate counts it is then possible to 
estimate the parameters. Since these are expressed 
in terms of categories rather than words, it is nec- 
essary to combine the counts of words to arrive at 
estimates. In all cases the estimates used are: 



Pr(<i 



rj ^-^ 



U'2G*2 



count (wi, W2) 
ambig(7(;i)ambig(7(;2) 



where 7y = ^ 



count (wi, W2) 
wxeAf ambig(7(;i)ambig(7(;2) 

Here ambig{'w) is the number of categories in 
which w appears. It has the effect of dividing the 
evidence from a training instance across all possi- 
ble categories for the words. The normaliser ensures 
that all parameters for a head noun sum to unity. 

2.4 Analysing the Test Set 



Given the high level descriptions in section |L3 



remains only to formalise the decision process used 
to analyse a noun compound. Each test compound 
presents a set of possible analyses and the goal is 
to choose which analysis is most likely. For three 
word compounds it suffices to compute the ratio of 
two probabilities, that of a left-branching analysis 
and that of a right-branching one. If this ratio is 
greater than unity, then the left-branching analy- 
sis is chosen. When it is less than unity, a right- 
branching analysis is chosen.^ If the ratio is exactly 
unity, the analyser guesses left-branching, although 
this is fairly rare for conceptual association as shown 
by the experimental results below. 

For the adjacency model, when the given com- 
pound is wiW2W^, we can estimate this ratio as: 



ti Gcats(-u;j) 



Pr(<i ^ t2) 



Et,ecats(»,) Pl'(^2 ts) 

For the dependency model, the ratio is: 



(1) 



It contains 1043 categories. 



If either probability estimate is zero, the other anal- 
ysis is chosen. If both are zero the analysis is made as if 
the ratio were exactly unity. 



In both cases, we sum over all possible categories 
for the words in the compound. Because the de- 
pendency model equations have two factors, they 
are affected more severely by data sparseness. If 
the probability estimate for Pr(i2 — > is) is zero for 
all possible categories t2 and then both the nu- 
merator and the denominator will be zero. This 
will conceal any preference given by the parame- 
ters involving ti. In such cases, we observe that the 
test instance itself provides the information that the 
event — *■ ta can occur and we recalculate the ra- 
tio using Pr(t2 ^ ^3) — k for all possible categories 
^2, ^3 where k is any non-zero constant. However, no 
correction is made to the probability estimates for 
Pr(ii 12) and Pr(ii t^) for unseen cases, thus 
putting the dependency model on an equal footing 
with the adjacency model above. 

The equations presented above for the dependency 
model differ from those developed in Lauer and Dras 
(1994) in one way. There, an additional weight- 
ing factor (of 2.0) is used to favour a left-branching 
analysis. This arises because their construction is 
based on the dependency model which predicts that 
left-branching analyses should occur twice as often. 
Also, the work reported in Lauer and Dras (1994) 
uses simplistic estimates of the probability of a word 
given its thesaurus category. The equations above 
assume these probabilities are uniformly constant. 



Section 3.2 below shows the result of making these 



two additions to the method. 
3 Results 

3.1 Dependency meets Adjacency 

Eight different training schemes have been used to 
estimate the parameters and each set of estimates 
used to analyse the test set under both the adjacency 
and the dependency model. The schemes used are: 



the pattern given in section 2.3; and 



• windowed training schemes with window widths 
of 2, 3, 4, 5, 10, 50 and 100 words. 

The accuracy on the test set for all these exper- 
iments is shown in figure ^. As can be seen, the 
dependency model is more accurate than the adja- 
cency model. This is true across the whole spec- 
trum of training schemes. The proportion of cases 
in which the procedure was forced to guess, either 
because no data supported either analysis or because 



both were equally supported, is quite low. For the 
pattern and two- word window training schemes, the 
guess rate is less than 4% for both models. In the 
three-word window training scheme, the guess rates 
are less than 1%. For all larger windows, neither 
model is ever forced to guess. 

In the case of the pattern training scheme, the 
difference between 68.9% for adjacency and 77.5% 
for dependency is statistically significant at the 5% 
level {p = 0.0316), demonstrating the superiority of 
the dependency model, at least for the compounds 
within Grolier's encyclopedia. 

In no case do any of the windowed training 
schemes outperform the pattern scheme. It seems 
that additional instances admitted by the windowed 
schemes are too noisy to make an improvement. 

Initial results from applying these methods to the 
EMA corpus have been obtained by Wilco ter Stal 
(1995), and support the conclusion that the depen- 
dency model is superior to the adjacency model. 

3.2 Tuning 

Lauer and Dras (1994) suggest two improvements to 
the method used above. These are: 

• a factor favouring left-branching which arises 
from the formal dependency construction; and 

• factors allowing for naive estimates of the vari- 
ation in the probability of categories. 

While these changes are motivated by the depen- 
dency model, I have also applied them to the adja- 
cency model for comparison. To implement them, 
equations |l| and ^ must be modified to incorporate 



a factor of 



in each term of the sum and 



|tl|l*2llt3l 

the entire ratio must be multiplied by two. Five 
training schemes have been applied with these ex- 
tensions. The accuracy results are shown in figure^. 
For comparison, the untuned accuracy figures are 
shown with dotted lines. A marked improvement is 
observed for the adjacency model, while the depen- 
dency model is only slightly improved. 

3.3 Lexical Association 

To determine the difference made by conceptual as- 
sociation, the pattern training scheme has been re- 
trained using lexical counts for both the dependency 
and adjacency model, but only for the words in 
the test set. If the same system were to be ap- 
plied across all of M (a total of 90,000 nouns), then 
around 8.1 billion parameters would be required. 
Left-branching is favoured by a factor of two as de- 
scribed in the previous section, but no estimates 
for the category probabilities are used (these being 
meaningless for the lexical association method). 



85 
80 
75 

Accuracy 70 
(%) 65 
60 
55 
50 



1 1 1 1 1 


1 1 1 
Dependency Model -• 

Adjacency Model -e — 
Guess Left ■ ■ ■ - 

• • 








1 1 1 1 1 


1 — — ^ * 


Pattern 2 3 4 5 


10 50 100 


Training scheme (integers 


denote window widths) 



Figure 2: Accuracy of dependency and adjacency model for various training schemes 



85 
80 
75 - 

Accuracy 70 - 
(%) 65 - 
60 - 
55 
50 



Tuned Dependency 
Tuned Adjacency 




Pattern 



10 



Training scheme (integers denote window widths) 
Figure 3: Accuracy of tuned dependency and adjacency model for various training schemes 



Accuracy and guess rates are shown in figure ^. 
Conceptual association outperforms lexical associa- 
tion, presumably because of its ability to generalise. 

3.4 Using a Tagger 

One problem with the training methods given in sec- 
tion |2.3| is the restriction of training data to nouns 
in Af. Many nouns, especially common ones, have 
verbal or adjectival usages that preclude them from 
being in J\f. Yet when they occur as nouns, they 
still provide useful training information that the cur- 
rent system ignores. To test whether using tagged 
data would make a difference, the freely available 
Brill tagger (Brill, 1993) was applied to the corpus. 
Since no manually tagged training data is available 
for our corpus, the tagger's default rules were used 
(these rules were produced by Brill by training on 
the Brown corpus). This results in rather poor tag- 
ging accuracy, so it is quite possible that a manually 
tagged corpus would produce better results. 

Three training schemes have been used and the 
tuned analysis procedures applied to the test set. 
Figure ^ shows the resulting accuracy, with accuracy 
values from figure |^ displayed with dotted lines. If 
anything, admitting additional training data based 
on the tagger introduces more noise, reducing the 
accuracy. However, for the pattern training scheme 
an improvement was made to the dependency model, 
producing the highest overall accuracy of 81%. 

4 Conclusion 

The experiments above demonstrate a number of im- 
portant points. The most general of these is that 
even quite crude corpus statistics can provide infor- 
mation about the syntax of compound nouns. At the 
very least, this information can be applied in broad 
coverage parsing to assist in the control of search. I 
have also shown that with a corpus of moderate size 
it is possible to get reasonable results without using 
a tagger or parser by employing a customised train- 
ing pattern. While using windowed co-occurrence 
did not help here, it is possible that under more 
data sparse conditions better performance could be 
achieved by this method. 

The significance of the use of conceptual associ- 
ation deserves some mention. I have argued that 
without it a broad coverage system would be impos- 
sible. This is in contrast to previous work on concep- 
tual association where it resulted in little improve- 
ment on a task which could already be performed. 
In this study, not only has the technique proved its 
worth by supporting generality, but through gen- 
eralisation of training information it outperforms 



the equivalent lexical association approach given the 
same information. 

Amongst all the comparisons performed in these 
experiments one stands out as exhibiting the great- 
est contrast. In all experiments the dependency 
model provides a substantial advantage over the ad- 
jacency model, even though the latter is more preva- 
lent in proposals within the literature. This result is 
in accordance with the informal reasoning given in 
section 1.3. The model also has the further commen- 



dation that it predicts correctly the observed pro- 
portion of left-branching compounds found in two 
independently extracted test sets. 

In all, the most accurate technique achieved an ac- 
curacy of 81% as compared to the 67% achieved by 
guessing left-branching. Given the high frequency of 
occurrence of noun compounds in many texts, this 
suggests that the use of these techniques in proba- 
bilistic parsers will result in higher performance in 
broad coverage natural language processing. 
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